Data Clean-up and Manipulation¶
At the beginning of this notebook it can be seen that I am seeking to clean and manipulate both the game data and the team statistic data frames. A little extra work is done here because in my initial proposal I was not looking into team statistics quite yet. Here more EDA can be seen with the new inputs from looking at team stats across the seasons.
Game_id is kept as a sort of index and reference point for all of the data. The data for indoor stadiums needed to be manipulated as there was missing data for both temp and wind in these environments. After significant research I settled on values of 1 for all wind in dome stadiums and 3 for closed retractable roof stadiums. I then held constant all temperature for indoor stadiums to simply be room temperature.
The data contained in game data vs team data was overlapping by all years except for the 2023 season, so this analysis is only up to date to the end of the 2022 season.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from patsy import dmatrices
team_stats = pd.read_csv('nfl-team-statistics.csv')
team_stats_copy = team_stats.copy()
team_stats_copy['home_team']= team_stats['team']
team_stats_copy['away_team']= team_stats['team']
stats_list = ['home_team', 'season', 'offense_completion_percentage', 'offense_ave_yards_gained_pass', 'offense_ave_yards_gained_run', 'defense_ave_yards_gained_pass', 'defense_ave_yards_gained_run', 'points_allowed']
trim_stats = team_stats_copy[stats_list]
nfl_raw ='https://raw.githubusercontent.com/nflverse/nfldata/master/data/games.csv'
nfl_game_data = pd.read_csv(nfl_raw)
nfl_copy = nfl_game_data.copy()
columns_list = ['game_id', 'season', 'total', 'overtime', 'away_team', 'home_team', 'total_line', 'roof', 'surface', 'temp', 'wind']
trim_nfl = nfl_copy[columns_list]
nfl_df = trim_nfl.copy()
nfl_df['point_total_reached'] = np.where(trim_nfl.total - trim_nfl.total_line >= 0, 1,0)
nfl_df_copy = nfl_df.copy()
nfl_df_copy.surface.fillna(value = 'UNKNOWN', inplace = True)
nfl_df_copy['opponent'] = nfl_df_copy.loc[:, 'away_team']
ready_nfl_df = nfl_df_copy.drop(columns = 'away_team')
ready_nfl_df.loc[ ready_nfl_df['opponent'] == 'OAK', 'opponent'] = 'LV'
ready_nfl_df.loc[ ready_nfl_df['home_team'] == 'OAK', 'home_team'] = 'LV'
ready_nfl_df.loc[ ready_nfl_df['home_team'] == 'SD', 'home_team'] = 'LAC'
ready_nfl_df.loc[ ready_nfl_df['opponent'] == 'SD', 'opponent'] = 'LAC'
ready_nfl_df.loc[ ready_nfl_df['home_team'] == 'STL', 'home_team'] = 'LA'
ready_nfl_df.loc[ ready_nfl_df['opponent'] == 'STL', 'opponent'] = 'LA'
ready_nfl_df['overtime'] = np.where(ready_nfl_df.overtime.values == 1, 'YES', 'NO')
nfl_merged_df = pd.merge(ready_nfl_df, trim_stats, how = 'outer', on = ['season', 'home_team'])
nfl_merged_df['wind'] = np.where(nfl_merged_df['roof'] == 'closed', 3, nfl_merged_df['wind'])
nfl_merged_df['wind'] = np.where(nfl_merged_df['roof'] == 'dome', 1, nfl_merged_df['wind'])
nfl_merged_df['temp'] = np.where(nfl_merged_df['roof'] == 'closed', nfl_merged_df.temp.mean(), nfl_merged_df['temp'])
nfl_merged_df['temp'] = np.where(nfl_merged_df['roof'] == 'dome', 70.0, nfl_merged_df['temp'])
nfl_merged_df.dropna(inplace = True)
nfl_merged_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 6215 entries, 0 to 6420 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 game_id 6215 non-null object 1 season 6215 non-null int64 2 total 6215 non-null int64 3 overtime 6215 non-null object 4 home_team 6215 non-null object 5 total_line 6215 non-null float64 6 roof 6215 non-null object 7 surface 6215 non-null object 8 temp 6215 non-null float64 9 wind 6215 non-null float64 10 point_total_reached 6215 non-null int64 11 opponent 6215 non-null object 12 offense_completion_percentage 6215 non-null float64 13 offense_ave_yards_gained_pass 6215 non-null float64 14 offense_ave_yards_gained_run 6215 non-null float64 15 defense_ave_yards_gained_pass 6215 non-null float64 16 defense_ave_yards_gained_run 6215 non-null float64 17 points_allowed 6215 non-null float64 dtypes: float64(9), int64(3), object(6) memory usage: 922.5+ KB
sns.relplot(data = nfl_merged_df, x = 'points_allowed', y = 'offense_ave_yards_gained_pass', hue = 'point_total_reached', kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.relplot(data = nfl_merged_df, x = 'points_allowed', y = 'offense_ave_yards_gained_pass', hue = 'home_team', col = 'point_total_reached', kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.relplot(data = nfl_merged_df, x = 'points_allowed', y = 'offense_ave_yards_gained_run', hue = 'home_team', col = 'point_total_reached', kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.relplot(data = nfl_merged_df, x = 'temp', y = 'wind', hue = 'home_team', col = 'point_total_reached', kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.relplot(data = nfl_merged_df, x = 'defense_ave_yards_gained_pass', y = 'offense_ave_yards_gained_run', hue = 'point_total_reached', col = 'home_team', col_wrap = 4, kind = 'scatter')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.catplot(data = nfl_merged_df, y= 'home_team', hue = 'point_total_reached', kind = 'count')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.catplot(data = nfl_merged_df, x = 'overtime', hue = 'point_total_reached', kind = 'count')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.pairplot(data = nfl_merged_df, hue = 'point_total_reached')
plt.show()
/opt/conda/envs/anaconda-panel-2023.05-py310/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
EDA Summary¶
When performing EDA, I noticed no real correlation besides the two inputs that helped create the output (total and total line). Of course, the reasoning for this could be that Vegas performs modeling on a completely different level. They could already be factoring in all of these inputs and tons more, making for the near perfect split of teams to reach their point total across 20+ seasons. The only real factor that showed any sort of skew was if a game went into overtime, which provides additional game time for potential scoring. One other thing that was apparent was the need to standardize these variables. The distribution was Gaussian for all inputs so no transformations were truly necessary.
With that in mind, when I took these results to modeling I considered altering my threshold, assessing performance through a different lens. If a model can correctly classify event/non-event even slightly above 50 percent, it can be viewed as accurate in comparison to the lines set. At this point I know there are way more complicated factors to include if truly seeking to gain any edge but this was an interesting starting point.